White Wine Quality by Fatih Kurt

This report expores a dataset containing quality and various other attributes of approximately 5k white wine samples.

Description of attributes:

  1. fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

  2. volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

  3. citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

  4. residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

  5. chlorides: the amount of salt in the wine

  6. free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

  7. total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

  8. density: the density of water is close to that of water depending on the percent alcohol and sugar content

  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

  10. sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

  11. alcohol: the percent alcohol content of the wine

    Output variable (based on sensory data):
  12. quality (score between 0 and 10)

Extended data set description is provided here.

Univariate Plots Section

## [1] 4898   12
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Our data set consists of 4898 entries each with 12 variables / columns.

Graphs above show that with majority being around quality 5-7, only few wines are of quality 9 or 3. Second graph shows distribution in a vertical logarithmic scale.

In this project we will be focusing on interaction of different variables in addition to the quality factor. However our quality factor consists of too many classes from 3 to 9. In a move to simplify this factor, I want to map qualities into 3 distinct qualities classes: 1. Lesser: Qualities 3, 4, 5 2. Average: Quality 6 3. Higher: Qualities: 7, 8, 9

Let’s all also look into distribution of other variables with respect to quality.

Distribution is normal. There are few outliers, some with too high values. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99.

Below are summary of the variable fixed.acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Distribution is normal with a little skewness towards right. We will look into volatile.acidity to see if it effects the quality.There are quite many outliers, which could also be dues to variable being skew towards right. Many of outliers seems to be in reasonable range. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99.

Below are summary of the variable volatile.acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Distribution is normal. There are few outliers, some with too high values. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99. However, there seems to be a bounce around 0.5. This could be due to acidity being reported with rounded values.

Below are summary of the variable citric.acid.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Distribution is skew towards right. But when plotted in logarithmic scale the distribution seems to be bimodal. There are very few outliers, some with too high values. I already plotted histogram in logarithmic x scale, so I didn’t need to apply further quantile filtering. I used a quantile filter between 0 - 1, which will retain all values without filtering.

Below are summary of the variable residual.sugar.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Distribution is normal. There are many outliers. The outliers seem to be forming a uniform distribution between 0.1 - 0.2. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.95. The upper quantile is quite lower than the earlier plots due to this uniform region between 0.1 - 0.2.

Below are summary of the variable chlorides.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Distribution is normal. There are few outliers, some with too high values. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99.

Below are summary of the variable free.sulfur.dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Distribution is normal. There are few outliers. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99.

Below are summary of the variable total.sulfur.dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Distribution is normal. There are very few outliers, some with too high values. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99.

Below are summary of the variable density.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Distribution is normal. There are many outliers, most with reasonable values. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99.

Below are summary of the variable pH.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Distribution is normal. There are many outliers. To display histogram in a better way, I filtered outlier values by applying a quantile filter between 0.01 - 0.99.

Below are summary of the variable sulphates.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Distribution is normal. There are no outliers. This is probably because, wines having an accepted alcohol level range which is beween 8 - 14. Therefore, I used a quantile filter between 0 - 1, which will retain all values.

Below are summary of the variable alcohol.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Univariate Analysis

Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!

What is the structure of your dataset?

Data set consists of around 5k observations. Each observation has values for 12 different variables. In addition to this variables, each has a quality rating.

What is/are the main feature(s) of interest in your dataset?

Quality is our main feature to investigate. I would like to understand what factors could be affecting the quality of wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Variables alcohol, density, volatile.acidity, chlorides, could be helping to understand what contributes to quality perception for white wines.

Did you create any new variables from existing variables in the dataset?

Yes. The most obvious candidate seems to be quality. Therefore, I created a new variable named quality.class with mapping provided below: 1. Lesser: Qualities 3, 4, 5 2. Average: Quality 6 3. Higher: Qualities: 7, 8, 9

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Variable residual.sugar have a skew distribution towards right. I plotted data in logarithmic scale to see if there is anything to note. The new plot was bimodal. The other variables mostly seem to be having normal distribution.

For most variables, I removed outliers by setting varying degrees of quantiles between 0.01-0.05 / 0.95-0.99. This helped me to see the important data more clearly. I also plotted residual.sugar in logarithmic scale.

I also changes number of bins in order to view distribution better.

Bivariate Plots Section

Let’s first start with a test to see the corelation between all variables:

> The pairs alcohol - density(-0.8) and density - residual.sugar(0.8) seem to have exceptionally high correlations.

However we are more concerned with quality feature. Therefore, we will first look into other variables’ interaction with quality.

The highest correation with quality seems to be between alcohol(0.4) and density(-0.3). Variables volatile.acidity(-0.2), chlorides(-0.2), and total.sulfur.dioxide(-0.2) also seem to have correlations with quality alas at lot lower measures.

Primary Variables

  1. density
  2. alcohol

With a little jitter for quality, which is a discrete value, it seems there is a negative correlation between density and quality.

Boxplot shows this relation better, with distributions for quality.

Below you can also find numerical values for these box plots:

## ww$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0001 
## -------------------------------------------------------- 
## ww$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0004 
## -------------------------------------------------------- 
## ww$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0024 
## -------------------------------------------------------- 
## ww$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## ww$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0004 
## -------------------------------------------------------- 
## ww$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0006 
## -------------------------------------------------------- 
## ww$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9897  0.9898  0.9903  0.9915  0.9906  0.9970
## 
##  Pearson's product-moment correlation
## 
## data:  density and quality
## t = -22.926, df = 4798, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3394837 -0.2884840
## sample estimates:
##        cor 
## -0.3142105

Correlation test shows that there is a correlation between [-0.34, -0.29] for these variables in a confidence level of 95%.

With a little jitter for quality, which is a discrete value, it seems there is a positive correlation between alcohol and quality.

Boxplot shows this relation better, with distributions for quality.

Below you can also find numerical values for these box plots:

## ww$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.35   11.00   12.60 
## -------------------------------------------------------- 
## ww$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## ww$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## ww$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## ww$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## ww$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## ww$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90
## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and quality
## t = 32.392, df = 4720, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4028330 0.4495123
## sample estimates:
##       cor 
## 0.4264566

Correlation test shows that there is a correlation between [0.40, 0.45] for these variables in a confidence level of 95%.

Secondary Variables

  1. volatile.acidity
  2. chlorides
  3. total.sulfur.dioxide

## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and quality
## t = -11.36, df = 4778, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1896410 -0.1344307
## sample estimates:
##        cor 
## -0.1621628

Correlation test shows that there is a correlation between [-0.19, -0.13] for these variables in a confidence level of 95%.

## 
##  Pearson's product-moment correlation
## 
## data:  chlorides and quality
## t = -17.025, df = 4790, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2653861 -0.2119856
## sample estimates:
##        cor 
## -0.2388664

Correlation test shows that there is a correlation between [-0.24 -0.18] for these variables in a confidence level of 95%.

## 
##  Pearson's product-moment correlation
## 
## data:  total.sulfur.dioxide and quality
## t = -13.024, df = 4798, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2119747 -0.1573235
## sample estimates:
##        cor 
## -0.1847919

Correlation test shows that there is a correlation between [-0.21, -0.16] for these variables in a confidence level of 95%.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Our primary variables alcohol and density have 0.43 and -0.31 correlation with quality respectively. The p-value for both is lower than 2.2e^-16. This means alcohol ratio increases with higher quality of wine. On the other hand, it means density drops when quality increases. Both variables were pointed out in the first part, and it seems our priliminary judgement about these variable sbeing correlated is plausible.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Our secondary variables > volatile.acidity, chlorides, and total.sulfur.dioxide have -0.16, -0.21, and -0.18 correlations with quality respectively. All three variables seem to be decreasing in higher quality wines. Once again, our judgement about these variables seems to be holding.

What was the strongest relationship you found?

The highest correlations seem to between following pairs:

  1. alcohol - density: -0.80
  2. density - residual.sugar: 0.83

Both correlations make much sense, since density of alcohol is below water. And amount of solved solid such as sugar increases density.

However the strongest relationship involving quality is with alcohol. (0.43)

Multivariate Plots Section

Let’s first look into variables that we found have a strong correlation with each other. I would also like to display what qualities they have by applying a colur gradient on data points.

## `geom_smooth()` using method = 'gam'

## `geom_smooth()` using method = 'gam'

In Density - Alcohol graph we can see that different qualities are actually uniformly distributed in density axes[vertical]. However on alcohol axes[horizontal], we see that the frequency of higher qualities is increasing. We already saw this relationship earlier, so I am not going further into detail.

In Density - residual.sugar graph, ther seems to be a difference of distribution of different qualities in vertical axes[density]. Perhaps, this hints that higher quality wines tends to be of lower density. However this could also be due to alcohol having an effect on density.

On the hand, there seems to be a relationship with a newgative correlation between residual.sugar and alcohol. This makes sense, since to have higher degrees of alcohol, most of sugar needs to be converted/fermented into alcohol. The strage thing about all these graphs is the threeway relationship that we will talk later.

To eliminate effect of density on alcohol and vice versa, I will repeat above first two plots by dividing each by the other.

It seems when we removed residual.sugar effect from density by dividing density by normalized (0-1) residual sugar values, the relationship with alcohol seems to be still there. Therefore, we can presume that most of density alcohol relationship is actually alcohol oriented. However, this relationship breaks at the higher the alcohol levels. This hints that sugar effect on density becomes significant at these levels.

However, the relationship with residual.sugar is moslty lost when we remove alcohol effect the same way.

From elementary chemistry, we know the realtionship between clorides, pH and acidity. I would like to draw a few plots regarding these variables. In a strange way, however, this process created distinct groups of density / alcohol.normalized groups.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

## 
##  Pearson's product-moment correlation
## 
## data:  pH and chlorides
## t = -6.3542, df = 4896, p-value = 2.285e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.11814666 -0.06259154
## sample estimates:
##         cor 
## -0.09043946

I am actually surprized that higher pH(lower acidity) does not correlate to higher amount of chlorides, as I would expect. Therefore, I would like to look into chlorides by eliminating effect of acids on pH.

Strangely, removing effects of fixed.acidity improved the visibility between different quality class. With this new graph we can say that higher quality wines tend to have lower levels of ph / fixed.acidity ratios.

Above graph shows scatter plot for residual.sugar vs. chlorides. Unlike bad quality [3,6) wines, the good quality (6,9] wines tend to have low levels of residual.sugar until it reaches a certain level of chlorides. Clorides being salt ingredient is possibly compensated with added sugars.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

There is strong relationship between density / residual.sugar and density / alcohol. However, this relationship is threeway. Meaning, there is also a relationship between residual.sugar and alcohol, which makes the situation all more harder to understand. By eliminating each of variables, I ve came to the conclusion that, that the core variable that lead the other two is alcohol. Alcohol decreases density much more than sugar increases it. On the other hand, sugar levels are lower in higher alcohol concentration wines. This does also makes sense as the better wine is fermented, the lesser non-fermented sugar we would find.

Were there any interesting or surprising interactions between features?

The threeway interaction between residual.sugar, alcohol, and density was interesting. The most surprising re3altionship I found was the one between clorides and pH. I was really expecting to find more chlorides with higher pH wines. This is because, as pH level nears to the neutral point 7-pH it is expected to find more chlorides than alcalines/acids.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I was not able to find and strong relationship involving quality. Therefore, I don’t think it would make much sense to extract a model for this variable.


Final Plots and Summary

Plot One

Description One

This scatter plot shows alcohol / residual.sugar disribution of various white wine samples. Green being the highest quality(9), and red being the lowest quality(3), the colors denote the quality of wine samples. Similarly, green line indicates trendline for alcohol ratio for high quality (quality 7, 8, and 9) wines, while the red one indicates the same for lower quality (qualities 3, 4, and 5) wines.

The plot is particularly expressive in a way that it shows the difference between alcohol correlation to residual sugar for lower quality wines and higher quality wines. The average alcohol ratio drops significantly for higher quality wines, while it drops little for lower quality ones.

Plot Two

Description Two

This histogram shows distribution of residual.sugar in various white wine samples. The histogram, unlike all others, shows a bi-modal distribution. The important aspect here is the fact that higher quality wines are also uniformly distributed throughout this distribution in terms of their ratio to the total. This means, good and bad quality wine producers alike prefers to produce wines with either low or high concentrations of residual.sugar. From our residual.sugar vs. alcohol plot, we know that average alcohol ratio for higher quality wines decreases with increasing amount of residual.sugar, but there is no correlation between quality and residual.sugar, whereas there is a positive correlation between quality and alcohol. Therefore, we can again conclude that people like either high alcohol wines or sweet wines with around 10g sugar per liter.

Plot Three

Description Three

This scatter plot shows density / alcohol.normalized vs. residual.sugar disribution for various white wine samples. Green being the highest quality(9), and red being the lowest quality(3), the color of points denote the quality of wine samples. Similarly, green line indicates trendline for alcohol ratio for high quality (quality 7, 8, and 9) wines, while the red one indicates the same for lower quality (qualities 3, 4, and 5) wines.

The plot is particularly expressive in a way that it shows the difference between trendlines of higher and lower quality wines with increasing values of residual sugar. Throughout graph, higher quality wines tend to have significantly lower levels of density after removing effects of alcohol on density. however when residual sugar amount reaches 10 gram/liter and more, the trendlines for lower and higher quality wines converge. This means the metric effecting the difference disappears at this level. I believe this is due to people not really caring about density when what they care about is a sweet wine, and that at those levels, the core variable becomes residual sugar.


Reflection

Data set has 12 attributes each of which are quite meaningful and provides enough tools to work on it to understand if there is useful relationships within samples. The quality attribute also provides the mean to comment on effects of other variables on human perception of the product.

Data set has a sample size of 4898. This small sample size causes problems throught data exploration. The lack of required number of samples for a particular subgroup makes it difficult to conclude on some potential findings that otherwise, would be significant.

Data set is complete and consistent enough that it does not seem to have entries with missing values or meaningless values. This eases the data exploration. Throughout the data analysis process, the data showed consistent results in terms of completeness and distribution of data points.

The challange for me during this data exploration process was to understand relationship between variables that wouldnt easily give up there secrets. I failed to find correlation where I would expect them most. On the other hand, I encountered them in places that I least expected. Even though most of relationships could be more or less explained, they are mostly hidden until I reached them by visualizing this relationships.

Even though the process of finding a useful model or relationship for quality seems trivial, after a few steps it becomes more apparent that the quality is a much more complex phenomenon and human perception depends on a more complex set of variables. At this point, it seems to me that it would be useful to know what exactly made the taster to score the wine high or low quality. For instance, it could have been noted that the taster liked its sweetness, acidity, alcohol ratio etc.